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Abstract 

Wire-exposed, programmable microarchitectures including Trips [11], Smart Mem- 
ories [8], and Raw [13] offer an opportunity to schedule instruction execution and data 
movement explicitly. This paper proposes stream algorithms, which, along with a 
decoupled systolic architecture, provide an excellent match for the physical and tech- 
nological constraints of single-chip tiled architectures. Stream algorithms enable pro- 
grammed systolic computations for different problem sizes, without incurring the cost 
of memory accesses. To that end, we decouple memory accesses from computation and 
move the memory accesses off the critical path. By structuring computations in sys- 
tolic phases, and deferring memory accesses to dedicated memory processors, stream 
algorithms can solve many regular problems with varying sizes on a constant-sized 
tiled array. Contrary to common sense, the compute efficiency of stream algorithms 
increases as we increase the number of processing elements. In particular, we show 
that the compute efficiency of stream algorithms can approach 100 % asymptotically, 
that is for large numbers of processors and appropriate problem size. 

1 Introduction 

Our goal is to show that microtechnology provides us with an opportunity to design single- 
chip parallel machines on which we may increase efficiency by increasing the number of 
processors for many important applications. Microtechnology is about to revolutionize the 
design of computer systems for the second time since the first single-chip microprocessor, 
Intel's 4004, was released in 1971. While the amount of transistors that fit onto a single chip 
has been growing steadily, only now, thirty years later, are we reaching the critical mass for 
realizing a general-purpose parallel microarchitecture on a single chip. Research prototypes 
such as Trips [11], Smart Memories [8], and Raw [13] represent the first steps into the design 
space of tiled architectures, which are single-chip parallel machines whose architecture is 
primarily determined by the propagation delay of signals across wires [4] . 

To enable high clock frequencies on large chip areas, tiled architectures have short wires 
that span a fraction of the side length of a chip, and use registers to pipeline the signal 
propagation. Short wires, in turn, introduce a scheduling problem in space and time to cope 
with the propagation of signals across distances longer than those reachable via a single 
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wire. Moving data across wires and distributing operations across processors are equally 
important scheduling goals. This scheduling problem has received attention in the context 
of VLSI design [10], parallel computation [6], and parallelizing compiler design [15] in the 
past. 

The speed advantage of short wires has not gone unnoticed. In fact, systolic arrays were 
proposed by Kung and Leiserson in the late 1970's [5], and aimed, in part, at exploiting the 
speed of short wires. Lacking the chip area to support programmable structures, however, 
early systolic arrays were designed as special-purpose circuits for a particular application, 
and were customized for a given problem size. Later systolic systems such as Warp [1] 
became programmable, so they could reap the benefits of systolic arrays without sacrificing 
fiexibility. We believe that the significant area and energy eflBciency of systolic arrays merit 
their reexamination in face of the architectural similarities to recent tiled microarchitectures. 

A commonly accepted and generally applicable technique to overcome the specialization 
of a systolic array to a particular problem size is the simulation of multiple processing 
elements on one larger, more powerful systolic processor. Such a processor uses local memory 
to store a potentially unbounded amount of data. Thus, the local memories required to 
run large problems must also be large. Large memories use a load/store interface to cope 
with intrinsically large access times. Load and store operations do not contribute useful 
computation, however, and constitute a burden on the critical path. In contrast, small 
memories are fast and can be organized as register sets. Register accesses are integrated 
as operand accesses into machine instructions where they do not eflFect the critical path 
adversely. Thus, it seems worthwhile to abandon the simulation technique and apply the 
systolic method to large problem sizes while using a small amount of local memory only. 




Figure 1: A decoupled systolic architecture (DSA) is an i? x i? array of compute processors 
(P) surrounded by AR memory processors (M) as shown for i? = 8. Compute processors use 
fast memory in form of a register set only. Each memory processor consists of a compute 
processor plus a memory interface with slower memory of larger capacity. 

In this article, we propose a new strategy for structuring parallel programs for tiled 
architectures, together with a resulting class of decoupled systolic algorithms that we call 
stream algorithms. Stream algorithms allow tiled architectures to operate in a systolic 
fashion for many regular problems with variable problem sizes. Stream algorithms require 



augmenting a tiled compute array with a set of memory processors on the periphery of the 
array, and potentially off-chip, as illustrated in Figure 1. A stream algorithm solves a large 
problem by breaking it into smaller, systolic subproblems, and by storing input data and 
intermediate results in the peripheral memories. The input data are supplied to the compute 
processors from the memories in a continuous stream^ via the network, thereby eliminating 
load/store memory accesses on the compute processors. Thus, stream algorithms decouple 
memory accesses from computation, and move the memory accesses off the critical path. 

Stream algorithms combine the benefits of decoupling^ the software version of the decou- 
pled access/execute architecture [12], and systolic algorithms^ the software version of systolic 
arrays [5]. Together, both features constitute an excellent match for the physical and tech- 
nological constraints of future tiled microarchitectures. There is one catch, though. Since 
stream algorithms are structured in systolic phases, they do not achieve high compute effi- 
ciency unless the subproblems can be pipelined. Decoupling introduces yet another source 
of inefficiency, namely the memory processors that execute memory accesses rather than 
contributing useful computation. We show how stream algorithms amortize the loss of effi- 
ciency by using a large number of processors with an asymptotically insignificant amount of 
memory processors. We also identify a kernel of architectural features needed to implement 
a decoupled systolic microarchitecture for executing stream algorithms area efficiently. 

The remainder of this paper is organized as follows. In Section 2 we introduce our 
decoupled systolic architecture. Section 3 defines our notion of stream algorithms. Then, we 
discuss five applications, a matrix multiplication in Section 4, a triangular solver in Section 5, 
an LU factorization in Section 6, a QR factorization in Section 7, and a convolution in 
Section 8. We show how to formulate these applications as stream algorithms, and argue 
why the resulting algorithms achieve optimal compute efficiency of 100 % asymptotically 
when executed on our decoupled systolic architecture. 

2 A Decoupled Systolic Architecture 

A decoupled systolic architecture (DSA) is a type of single-chip tiled architecture. We 
assume a fast network consisting of short wires connecting processors in a mesh topology 
as shown in Figure 1. Our DSA consists of an i? x i? array of compute processors^ and 
AR memory processors on the periphery of the compute array. The peripheral memory 
processors are the distinguishing feature of DSA's. Each of the memory processors consists of 
a compute processor with additional local memory. We assume that the memory can deliver a 
throughput of one load or store per clock cycle. The compute processors can be implemented 
in one of many architectural styles with varying degrees of efficiency, for example, VLIW, 
TTA, or superscalar. However, the choices for achieving 100 % compute efficiency in an 
area-efficient fashion are more limited. This section focuses on the key architectural features 
for a DSA without dwelling on the details of a particular instantiation. 

The compute processor, shown in Figure 2, is a simple general-purpose programmable 



^Our notion of a data stream is consistent with the colloquial sense. According to Webster [9], a stream 
is "an unbroken flow (as of gas or particles of matter)," a "steady succession (as of words or events)," a 
"constantly renewed supply," or "a continuous moving procession (a stream of traffic)." 
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Figure 2: A compute processor contains a general-purpose register set (GPR), an integer unit 
(lU), and a floating-point unit (FPU) based on a multiply-and-add module. The processor 
connects via FIFO's to its four neighbors. 



core comprising an integer unit, a floating-point unit with a multiply-and-add module as 
the centerpiece, and a multi-ported, general-purpose register set. We do not include a larger 
local memory because of the intrinsic physical constraint that large memories have larger 
latencies than small ones. Instead, we use a register set with relatively small but fast memory. 
To focus our attention on the datapath. Figure 2 omits all of the control logic and a small 
instruction memory. We assume a single-issue, in-order pipelined FPU that allows us to 
issue one multiply-and-add operation per clock cycle. 

Arguably the most important feature of a DSA is the design of its on-chip network. 
Our interconnect uses two prominent features of Raw's static network [13]. The network 
is register-mapped, that is instructions access the network via register names, and it is 
a programmed routing network permitting any globally orchestrated communication 
pattern on the network topology. The latter is important for stream algorithms that change 
patterns between the phases of the computation. We discuss these features in more detail 
in the following paragraphs. 

As illustrated in Figure 2, we use blocking FIFO's to connect and synchronize neighbor- 
ing processors. These FIFO's are exposed to the programmer by mapping them into register 
names in the instruction set. The outgoing ports are mapped to write-only registers with 
the semantics of a FIFO-push operation, and the incoming ports as read-only registers with 
the semantics of a FIFO-pop operation. Furthermore, we prefer the network to be tightly 
integrated with the pipelined functional units. Accordingly, bypass wires that commonly 
feed signals back to the operand registers also connect the individual pipeline stages to the 
outgoing network FIFO's.^ The tight network integration ensures low-latency communica- 
tion between neighboring compute processors, and allows for eflBcient pipelining of results 
from operations with diflFerent pipeline depths through the processor array. 

Our decoupled systolic architecture uses a wide instruction word to schedule multiple, 
simultaneous data movements across the network, between the functional units and the 



^The tight integration with the processor pipeline is a key design aspect of the Raw architecture [13], 
which earlier register-mapped network architectures including the Connection Machine CM2 [3] and Warp [1] 
lack. 



network, as well as between the register set and the network. A typical DSA instruction 
such as 

fma $4,$4,$N1,$W2 route $N1->$S1, $W2->$E2 

consists of two parts. The fma operation is a floating-point multiply-and-add compound 
instruction. It multiplies the values arriving on network ports Nl and W2, and adds the 
product to the value in general-purpose register $4. Simultaneously, it routes the incoming 
values to the neighboring processors as specifled by the route part of the instruction. The 
value arriving at port Nl is routed to outgoing port SI, and the value arriving at port W2 to 
outgoing port E2. Instructions of our decoupled systolic architecture block until all network 
operands are available. Using small FIFO's with a length larger than one eases the problem 
of scheduling instructions substantially. There exists a trade-oflF between the instruction 
width and the area occupied by the corresponding wires within a processor. For our DSA, 
we assume that three data movements can be specifled within the route part of a single 
instruction. 

DSA's may be implemented as a virtual machine on top of a tiled architecture. We 
have experimented with this idea on Raw, which can be viewed as an architectural superset 
of our DSA. Since every Raw processor has local memory and can be viewed as a memory 
processor, we simulate our compute processors simply by ignoring the available local memory. 
In that respect. Raw is a functional, but not area-eflBcient DSA. Raw has separate switch 
and compute processors on each tile, each with their own instruction stream. We splice the 
instruction stream of each of our DSA processors into a Raw processor stream, a Raw switch 
stream, plus synchronization primitives. Another diflFerence is that Raw's floating-point unit 
is not based on a multiply-and-add module. Thus, the fma instruction of our DSA requires 
two instructions on Raw. 

3 Stream Algorithms 

In this section we introduce decoupled systolic algorithms, nicknamed stream algorithms, 
and a set of conditions for which we can increase eflBciency by increasing the number of 
processors such that the compute eflBciency approaches 100%. Alternatively, we may view 
stream algorithms as the product of a program-structuring methodology. We identify flve 
design principles for stream algorithms: 

1. We reduce the instruction count on the critical path of a computation by 
abandoning load and store instructions on the compute processors. 

2. We use systolic designs^ because they are well suited for parallel machines with local 
interconnect structure and match our architecture with fast but short wires. 

3. We decouple memory accesses from computation by dedicating processors to 
one of the two tasks. This idea is motivated by the Decoupled Access/Execute Archi- 
tecture [12]. 



4. We use M memory processors and P compute processors, such that the number of 
memory processors M is asymptotically smaller than the number of compute processors 
P, that is M=o(P). 

5. We partition problems into decoupled systolic algorithms, and use the M memory 
processors for temporary storage. 

The key strategy for the design of an efficient decoupled systolic algorithm is to recog- 
nize that the number of memory processors must be negligible compared to the number of 
compute processors, because memory processors do not contribute any useful computation. 
While it is often impossible to design an efficient decoupled systolic algorithm for a very 
small number of processors and a very small problem, we can actually increase the efficiency 
for larger numbers of processors and large problems. We emphasize this observation by 
formulating the decoupling-efficiency condition. 

Definition 1 (Decoupling- Efficiency Condition) 

Given a decoupled algorithm with problem size N and a network of size R,^ let P{R) be the 
number of compute processors and M{R) the number of memory processors. We say the 
algorithm is decoupling efficient if and only if 

M{R) = o{P{R)), 

Informally, decoupling efficiency expresses that the number of memory processors be- 
comes insignificant relative to the number of compute processors as we increase the network 
size R. Decoupling efficiency is a necessary condition to amortize the lack of useful computa- 
tion performed by the memory processors. For example, suppose we implement an algorithm 
on P — R^ compute processors. If we can arrange the memory processors such that their 
number becomes negligible compared to P when increasing the network size i?, the resulting 
algorithm is decoupling efficient. Thus, for a decoupling-efficient algorithm with P — 9(i?^), 
we may choose M to be 6(lgi?), or M = 6(i?), or M = 6(i?lgi?). In contrast, a design with 
M — Q[R^) would not be efficiently decoupled. Decoupled systolic algorithms per se are 
independent of a particular architecture. Note, however, that the DSA shown in Figure 1 is 
particularly well suited for executing either one such algorithm with (P, M) — (9(i?^), B(i?)) 
or multiple algorithms concurrently with (P, M) — (B(i?), B(l)). 

Decoupling efficiency is a necessary but not sufficient condition to guarantee high per- 
formance. We determine the compute efficiency of a stream algorithm with problem size N 
on a network of size R from the number of useful compute operations C{N)^ the number of 
time steps r(A^, i?), and the area counted in number of processors P{R) + M{R): 

^^^' ^^ " T{N,R)^{P{R) + M{R))' ^^^ 

The product of time steps and area can be interpreted as the compute capacity of the DSA 
during time period T. For all practical purposes, we may relate the problem size N and 



^We use the network size i? as a canonical network parameter. The number of processing nodes is 
determined by the network topology. For example, a 1-dimensional network of size R contains R processing 
nodes, whereas a 2-dimensional mesh network contains R'^ processing nodes. 



network size R via a real- valued a such that N = aR. Substituting aR for N in Equation 1, 
we define compute eflBciency by means for the following condition. 

Definition 2 (Compute-Efficiency Condition) 

We call an algorithm with problem size N compute efficient when executed on a network 
of size R, if and only if 

lim E{g,R) = 1, 

where N — aR. 

Equation 1 implies a necessary condition for obtaining a compute-eflBcient algorithm: 
either the number of memory processors M = or the algorithm is decoupling eflBcient. If 
we operate a compute array without any memory processors it is a systolic array. We are 
interested in the case where M > 0, however, and compute eflBciency implies decoupling 
eflBciency as a prerequisite. Thus, with decoupling eflBciency as necessary condition for 
achieving 100 % compute eflBciency asymptotically, every compute-eflBcient stream algorithm 
is decoupling eflBcient, whereas the converse is not true. The compute-eflBciency condition 
requires that both R ^ oo and a = N/R -^ oc. Thus, in practice we require that N ^ R^ 
which we view as a realistic assumption since using a very large network implies that we 
intend to solve a very large problem. For cr = 1, the problem size matches the network size, 
and we operate the network as a systolic array. Since decoupling-eflBcient stream algorithms 
use an asymptotically smaller number of memory processors than compute processors, we 
may view stream algorithms as a subset of systolic algorithms with a restricted number of 
inputs and outputs. Inversely, we may view a systolic algorithm as a special case of a stream 
algorithm that is distinguished by a = 1 in A^ = aR. We discuss the trade-oflF between N 
and R during our discussion of stream algorithms below. 

Before presenting concrete examples of stream algorithms, we outline our general stream- 
structuring methodology^ which consists of three steps: 

Partitioning: Given a problem with a > 1 in A^ = aR^ that is the problem size N is larger 
than the network size i?, we start by partitioning the problem into smaller, independent 
subproblems. Each of the subproblems as well as the composition of their results must 
be suitable for parallelization by means of a systolic algorithm such that the compute 
processors access data in registers and on the network only. For simple data-parallel 
applications, the partitioning can be obvious immediately. For applications with more 
complicated data dependencies, we find that recursive formulations and partitioning 
methods like those developed for out-of-core algorithms [14] can be helpful. To simplify 
the design of the systolic algorithm, retiming [7] may be used. It allows us to start 
the design with a semi-systolic algorithm, which we can transform automatically into 
a systolic algorithm if one exists [6]. The design of a semi-systolic algorithm can be 
significantly easier than that of a systolic version, because it permits the use of long 
wires that extend beyond next-neighbor processors. 

Decoupling: Our goal is to move the memory accesses oflF the critical path. To this end, 
we have to decouple the computation such that the memory accesses occur on the 
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memory processors and compute operations on the compute processors. For a systolic 
problem, the memory processors feed the input streams into the compute processors, 
and the decoupling procedure is almost trivial. However, the composition of several 
subproblems requires careful planning of the flow of intermediate data streams, such 
that the output streams of one systolic phase can become input streams of a subsequent 
phase without copying streams across memory processors. Occasionally, it may be 
beneflcial to relax the strict dedication of memory processors to memory accesses, 
and compute portions of the composition of the subproblems, such as reductions, on 
the memory processors themselves. Therefore we integrate a fully-fledged compute 
processor into the memory processor of our stream architecture. 

Efficiency Analysis: After partitioning and decoupling, we have designed a stream algo- 
rithm. To qualify as a compute-eflBcient stream algorithm, however, we also require 
that the compute-eflBciency condition holds. Therefore, the choice of the number of 
memory processors must be asymptotically smaller than the number of compute pro- 
cessors, and we must show that E{a, R) approaches 1 for large values of R. Meeting 
the compute-eflBciency condition requires that we order the subproblems for optimal 
pipelining on the compute array. Experience shows that we may need to iterate over 
the partitioning and decoupling steps until a compute-eflBcient solution is found. 

Let us emphasize the concept of a stream algorithm by highlighting what a stream al- 
gorithm is not. (1) A stream algorithm is not a collection of N tasks that is scheduled on 
R < N processors, using time sharing and context switching to guarantee progress. Instead, 
a stream algorithm is a computation structured such that the schedule of individual tasks 
is determined by the order of elements in the data streams that is primarily organized by 
the memory processors. Also, (2) a stream algorithm does not simulate \N/R'] = [a] pro- 
cessors of a systolic array on one compute processor. While simulation of a systolic array 
is a generally applicable method for executing a parallel algorithm of problem size N on 
R < N processors, each of the R processors needs an unbounded amount of memory to store 
the state of each of the [a] subproblems, and consequently additional instructions must be 
executed to manage a large local memory. In contrast, stream algorithms avoid local mem- 
ory accesses entirely by decoupling the computation from memory accesses, and moving the 
memory accesses oflF the critical path. 

4 Matrix Multiplication 

As our flrst example of a stream algorithm, we consider a dense matrix multiplication. Given 
two N X N matrices A and 5, we wish to compute the N x N matrix C = AB. We compute 
element Cij in row i and column j of product matrix C as the inner product of row i of A 
and column j of B: 



N 



a 



V 



z2 ^^^ ' ^^i' (2) 



k=i 
where 1 < i,i < A. 



Partitioning 

We use a block-recursive partitioning for the matrix multiplication. We recurse along the 
rows of A and the columns of B: 



Cii Ci2 

C21 C22 



- (t)(^"^'0 



(3) 



For each of the matrices Cij we have Cij — AnBij^ where An is an N/2 x N matrix and Bij 
an A^ X N/2 matrix. Thus, the matrix multiplication can be partitioned into a homogeneous 
set of subproblems. 

Decoupling 

We begin by observing that each product element Cij can be computed independently of all 
others by means of Equation 2. In addition, Equation 3 allows us to stream entire rows 
of A and entire columns of B through the compute processors. Furthermore, we partition 
a problem of size N x N until the Cij are of size R x R and fit into our array of compute 
processors. We implement the resulting subproblems as systolic matrix multiplications, 
illustrated in Figure 3 for A^ = i? = 2. Rows of A flow from the left to the right, and 
columns of B from the top to the bottom of the array. 
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Figure 3: Seven time steps of a systolic matrix multiplication C = A - B for 2 x 2 matrices. 
Each box represents a compute processor. Values entering, leaving, or being generated in 
the array are shown in bold face. Shaded boxes mark the completion of an inner product. 
We split the data flow of the operands and products into the top and bottom rows. 



For N > R^ the compute processor in row r and column s computes the product elements 
Cij for all i mod R — r and j mod R — s. To supply the compute processors with the proper 
data streams, we use R memory processors to store the rows of A and R additional memory 
processors to store the columns of B. Thus, for the matrix multiplication, we use P — R^ 
compute processors and M — 2R memory processors. Figure 4 illustrates the data flow of 
a decoupled systolic matrix multiplication for A^ = 4 and R — 2. Note how the memory 
processors on the periphery determine the schedule of the computations by streaming four 
combinations of rows oi A and columns oi B into the compute processors. First, we compute 
Cii by streaming {A(l,:), A(2,:)} and {5(:,1), 5(:,2)} through the array. Second, we 
stream {^(1,:), A(2,:)} against {5(:,3), 5(:,4)}, third, {A(3, :), A(4, :)} against {5(:,1), 
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Figure 4: Data flow of a compute-eflBcient matrix multiplication C = A • 5 for 4 x 4 matrices 
on 2 X 2 compute processors. Shaded boxes on the periphery mark memory processors, and 
indicate the completion of an inner-product otherwise. 

5(:, 2)}, and flnally {^(3, :), A(4, :)} against {5(:, 3), 5(:, 4)}. As a result, we compute Cn, 
C12, C'21, and C22 in that order. 

If product matrix C cannot be streamed into a neighboring array of consuming compute 
processors or oflF the chip altogether, but shall be stored in memory processors, we may 
have to invest another R memory processors for a total of M = 3i?. In any case, we have 
P — 9(i?^) and M — 0(i?), and hence M — o{P). We conclude that the structure of our 
matrix multiplication is decoupling eflBcient. 

Note that we could use a similar organization to compute a matrix-vector product Ax^ 
where A is Slu N x N matrix and x an A^ x 1 vector. However, using only one column 
of i? X 1 compute processors requires M = i? + 1 memory processors. Since M ^ o{P)^ 
this organization is not decoupling eflBcient. However, there exists a diflFerent design that 
is decoupling eflBcient by storing matrix A and vector x on one memory processor and by 
distributing the inner products across a linear array of compute processors. 

Efficiency Analysis 

The number of multiply-and-add operations in the multiplication of two N x N matrices is 
C{N) = A^^. On a network of size R with P — R^ compute processors and M — 2R memory 
processors, we pipeline the computation of {N/ RY systolic matrix multiplications of size 
R X N times N x R. Since this pipelining produces optimal processor utilization, and the 
startup and drain phases combined take 3i? time steps (cf. Figure 4), the total number of 
time steps required by this computation is 

Tmm{N,R) = {N/RfR + 'iR. 
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According to Equation 1, the floating-point efficiency of our matrix multiplication is therefore 

Err,m{N,R) = ((^/^)3^ + 3^) . (^2 + 2i?) ' 

Using a = N/R instead of parameter N^ we obtain 

for the efficiency. Consider each of the two product terms independently. Term a^ /{a^ + 3) 
approaches 1 for large values of a, that is if the problem size N is much larger than the 
network size R. On the other hand, term R/{R + 2) approaches 1 for large network sizes R. 
If we assume a constant value a ^ 1, we find that the efficiency of the matrix multiplication 
increases as we increase the network size, and approaches the optimal fioating-point efficiency 
of 100 % asymptotically. We also note that for a fixed a, the stream matrix multiplication 
requires T{N) = (cr^ + 3/a)N = 9(A^) time steps on a network with [N/aY compute 
processors. 

In practice, the network size R is subject of a delicate trade-oflF. To maximize efficiency, 
we want to maximize both terms in Equation 4. Thus, given a problem size N ^ to increase 
the first term, we want to increase a — N/R and, hence, decrease R. On the other hand, 
to maximize the second term, we want to increase R. To determine a good value R for 
implementing a DSA, let us consider some absolute numbers. For example, if A^ = i?, that 
is cr = 1, we have a systolic matrix multiplication with 

1 /? 

Thus, the maximum efficiency is just 25 % even for an infinitely large network. On the other 
hand, for a relatively small value a = 8, we have 

Emm{^ = 8, R) =0.99 



R + 2 

Hence, for a network size of i? = 16, a compute-efficient matrix multiplication of problem size 
A^ = 8 • 16 = 128 achieves almost 90% efficiency. Larger problem sizes and larger networks 
operate above 90 % efficiency. 

5 Triangular Solver 

A triangular solver computes the solution x of a linear system of equations Ax = b assuming 
that matrix A is triangular. Here is an example with a 4 x 4 lower-triangular matrix A. 
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Finding solution a; is a straightforward computation known as forward substitution: 

hi 

Xi — — 
an 

I I '-' \ 

Xi = — \h-^(^ijXj\ for i = 2,3, ...,7V. 

We are interested in triangular solvers as building blocks of other algorithms including an 
LU factorization. In particular, we are interested in the lower-triangular version that finds 
an A^ X A^ matrix X as the solution of AX = 5, where 5 is an A^ x A^ matrix representing A^ 
right-hand sides. 

Partitioning 

We partition the lower-triangular system of linear equations with multiple right-hand sides 
recursively according to Equation 5. Matrices An and A22 are lower triangular. 

Xn X12 \ _ f Bii B12 \ / X 

X21 X22 ) \ B21 B22 ) 

The partitioned triangular form leads to a series of smaller problems for the lower-triangular 
solver: 

(6) 
(7) 
(8) 
(9) 
(10) 

(11) 

First, we compute the solution of the lower-triangular systems in Equations 6 and 7, yielding 
Xii and X12. We use these solutions subsequently to update matrices B21 and ^22 in 
Equations 8 and 9, producing B21 and B22' We could compute the matrix subtraction 
in Equations 8 and 9 on the compute processors of the array. However, we can save the 
associated data movement by executing the subtraction on the memory processors. This 
alternative is simpler to program as well. Matrices B21 and B22 are the right-hand sides 
of the lower-triangular systems in Equations 10 and 11. Solving these systems yields X21 
and X22. Thus, Equations 6-11 define a recursive algorithm for solving the lower-triangular 
system of Equation 5. The recursion reduces the problem of solving a lower-triangular system 
of linear equations into four smaller lower-triangular systems of linear equations, plus two 
matrix multiplications that we have discussed in Section 4 already. 

Decoupling 

To arrive at a decoupled design, we observe that the computations for the individual right- 
hand sides of the linear system AX — B are independent. Consider the following system for 
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A^ = 3 and two right-hand sides. 

&11 &12 
^31 ^32 

The computation of column j of X depends on the elements of A and the elements of 
column i of B only, which means that computations of columns of X may be performed 
independently. 

Figure 5 depicts the systolic algorithm for our lower-triangular solver. We stream rows 
of A from the left to the right and columns of B from the top to the bottom of the compute 
array, while columns of X stream from the bottom of the array. The processor ^^ in row i 
and column j of the compute array is responsible for computing element Xiy Note that 
due to the independence of columns in this computation we may permute the columns of B 
arbitrarily, provided we preserve the staggered data movement. We can also use the systolic 
design of Figure 5 for an upper-triangular solver by reversing the order in which the rows 
of A are stored on the memory processors, and by reversing the order in which the elements 
of the columns of B are fed into the compute processors. 

We illustrate the systolic algorithm by describing the computation of element x^x — 
(631 — az\X\\ — a32X2i)/a33. We begin with time step 4 in Figure 5. Processor ^31 receives 
element X\\ from ^21 above and a3i from the left, and computes the intermediate result s — 
a3i • X\\. At time step 5, processor ^31 receives element X21 from above and a32 from the left. 
Executing a multiply-and-add operation, ^31 computes intermediate result t = s + a32 • X21. 
At time step 6, processor ^31 receives a33 from the left and 631 from ^21 above, and computes 
^31 = (^31 — ^)/<^33- During the next time step 7, element x^x is available at the bottom of 
the array. 

When reducing a problem of size NxN recursively until the subproblems fit into an i? x i? 
array of compute processors, we need 3R memory processors on the periphery of the compute 
array to buffer matrices A, 5, and X. Figure 6 shows the computation of Xu and X21 by 
means of Equations 6, 8, and 10. As implied by this figure, we use R memory processors 
to store the rows of A, and R memory processors for the columns of B and X, respectively. 
Thus, for a decoupled systolic lower-triangular solver, we require P — R^ compute processors 
and M — 3R memory processors, meeting our decoupling-efficiency condition M = o{P). 

Unlike the matrix multiplication, the factorization of the triangular solver does not pro- 
duce identical subproblems. Therefore, we are faced with the additional challenge of finding 
an efficient composition of these subproblems. Although we can pipeline the subproblems, we 
cannot avoid idle cycles due to data dependencies and the heterogeneity of the computations 
in Equations 6-11. However, we can minimize the loss of cycles by grouping independent 
computations of the same type, and pipelining those before switching to another group. For 
example, we can group and pipeline the computations of Equations 6 and 7, then Equations 8 
and 9, and finally Equations 10 and 11. If we unfold the recursion all the way to the base 
case of R X R subproblems, we find that the best schedule is equivalent to a block-iterative 
ordering of the subproblems. 
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Figure 5: Systolic lower-triangular solver for A^ = 3 and two right-hand sides. 



Efficiency Analysis 

We compute the efficiency of our lower-triangular solver according to Equation 1. The num- 
ber of floating-point multiply- and- add operations is C{N) = A^^/2, counting each division 
as a multiply-and-add operation. 

As mentioned above, the crux for an efficient schedule is to order the computations in 
Equations 6-11 such that two subsequent systolic algorithms can be overlapped. For the 
matrix multiplication, flnding a perfect overlap is relatively easy, because the there is only 
one systolic algorithm. For the lower-triangular solver, we illustrate the search for a good 
schedule and the corresponding efficiency analysis by means of the example in Equation 12, 
where matrices A, A, and B consist of a x a blocks, each block is of dimension R x R^ and 
a = 4. 
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Figure 6: Phases of a decoupled systolic lower-triangular solver on an i? x i? array of compute 
processors. In phase 1 we solve Equation 6 for Xn. In phase 2 we update B21 according 
to Equation 8. While the matrix multiplication is executed on the compute processors, the 
matrix subtraction is performed on the memory processors as they receive each individual 
result of A21X11. Finally, in phase 3 we solve Equation 10 for X21. The shapes of the matrix 
areas indicate how the rows and columns enter the compute array in a staggered fashion. 

Recall that the computations of the individual columns and, thus, column blocks of X 
are independent. Therefore, we may sequence the systolic computations across rows to 
maximize overlap. In the 2x2 partitioning of Figure 6, the computation of a column block 
of X consists of two solver computations and one update operation. The interleaving of two 
such computations yields the sequence of Equations 6-11. Now, consider column block i of 
the 4x4 example in Equation 12 with the following sequence of operations. First, solve 

B'n for 



AiiXii — Bii for Xii. Second, update B^^ = B2i — ^21X1^. Third, solve A22X 



X2i. Fourth, update B'^- = B^i — ^31X1^ — ^23X2^, which involves two update operations. 
Fifth, solve ^33X3^ = B'^- for X3^ Sixth, update B'^- = B^i - A^iXu - ^42X2^ - ^43X3^, 
involving three update operations. Finally, solve ^44X4^ = B'^- for X4^. We observe that 
for increasing a, the number of solvers increases quadratically while the number of update 
operations increases cubically. 

Since the update operations are little more than matrix multiplications, we may pipeline 
and overlap as many of them as possible, resembling our stream-structured matrix multi- 
plication. Thus, we use a block iterative schedule that iterates over row blocks. For each 
row block i, we schedule the solver computations of an entire row i with maximal overlap. 
There are a solvers for each row block, which require aR time steps plus 2R time steps for 
starting and draining the pipeline. Then, we compute and overlap all update operations 
associated with Xij. For each X^j, there are a — i update operations, resulting in a{a — i) 
update operations associated with row i. These operations require g{g — i)R time steps 
plus 3i? time steps to start and drain the pipeline. We may save another 2R time steps by 
overlapping the first update operation with the last solver computation and the first solver 
computation of the next row block with the last update operation of the previous row block. 
The number of time steps for an X x X lower-triangular solver with a x a blocks of size 
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X i? is then: 




Tits{(T.R) -- 


i=l i=l 


- 


= ^{a' + a' + 6a-2). 



i)R + 3R)- ^2i? 



We find that, for a fixed a, the total number of time steps is r(A^) = 6(A^) when using 
{N/aY compute processors. 

According to Equation 1, the fioating-point eflBciency of our compute-eflBcient lower- 
triangular solver is then 

^'..(<'.«) - „,^;l,,.2 j^,- (13) 

Analogous to Equation 4 for the matrix multiplication, the eflBciency is the product of two 
terms, one depending indirectly on the problem size N via a, and the second depending on 
the network size R. For a = 1, the problem reduces to a single systolic lower-triangular 
solver, and we obtain an eflBciency of 

The eflBciency increases when we increase R and a, such that the fioating-point eflBciency 
approaches the optimal value of 100%. Since the solver requires memory processors along 
three sides of the array of compute processors, the second term requires a slightly larger 
network size to achieve high eflBciency. For example, for a very large a, we have Eits{R) ~ 
R/{R + 3), and we achieve more than 90 % eflBciency for R > 27. 

6 LU Factorization 

We wish to factor an A^ x A^ matrix A into two N x N matrices L and [/, such that L is 
lower-triangular, U is upper-triangular, and A = LU. Furthermore, for all diagonal elements 
of L = (lij) we require that /^^ = 1, where 1 < i < A^. 

Partitioning 

We partition the LU factorization according to Equation 14. Matrices Lu and L22 are lower 
triangular, while matrices Uu and U22 are upper triangular. 

/ An Au\ ^ ( Ln \( Un 

V ^21 ^22 / V ^21 i^22 ; V 

This partitioning results in a series of smaller problems: 

An = LnUn (15) 

Ai2 = LnUu (16) 

A21 = L2iUn (17) 

A^2 = A22-L21UU (18) 

A'^2 = L22U22 (19) 
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First, we compute Ln and Un by factoring An according to Equation 15. We use the 
results to solve Equations 16 and 17 for U12 and L21 respectively. Then, we update A22 
according to Equation 18, which produces A'22' As with the triangular solver, the matrix 
subtraction required by this step can be performed on the memory tiles. Finally, we use 
A'22 to compute L22 and U22' The recursive formulation due to Equations 15-19 reduces the 
problem of an LU factorization into two smaller LU factorizations, a matrix multiplication, 
a lower-triangular solver, and an upper-triangular solver of the form XU — B. The stream- 
structured version of this upper-triangular solver is similar to that of the lower-triangular 
solver. 

Decoupling 

We begin the derivation of a decoupled design with a systolic LU factorization. As an 
example, consider the LU factorization for A^ = 3. 

an ai2 ^13 

^21 ^22 ^23 
^31 ^32 ^33 

Figure 7 shows the progress of our systolic LU factorization. Columns of matrix A 
enter the compute array at the top, and fold over towards the bottom. The columns of 
upper-triangular matrix U leave the array at the bottom and the rows of lower-triangular 
matrix L on the right. The compute processor pij in row i and column j of the compute 
array computes either Uij if i < j, or lij otherwise. Since /^^ = 1, the diagonal elements of L 
are neither computed nor stored explicitly. 

The data flow pattern of the LU factorization is straightforward. Elements of L stream 
from left to right, and elements of U stream from top to bottom. When a pair of elements 
enters processor pij^ it computes an intermediate value of either lij or Uij. As a concrete 
example, we discuss the computation of element U22 = CL22 — hi * '^12, where Uu = ai2, 
^11 = CLii^ and /21 = CL2i/uii. Processor P22 at the center of the array will produce value 1^22- 
We begin at time step 6 of Figure 7. Processor p2i receives Uu from above and uses it to 
compute element /21, which is sent to the right and becomes available on processor ^22 at 
time step 7. Simultaneously, processor pi2 sends u^ = a^ downwards towards processor P22- 
At time step 7, processor ^22 receives elements u^ from pi2 above and /21 from p2i on the 
left. With element a22 already resident since time step 5, processor ^22 computes U22 = 
<^22 — hi ' '^12- Value U22 remains on processor ^22 during time step 7, while value Uu is sent 
towards neighbor ^32. Then, during time step 8, u^ is sent to neighbor ^32. At time step 9 
processor ^32 uses U22 to compute /32 = (a32 — /31 • t^i2)/'^22- In time step 10, element U22 
leaves the array at the bottom of processor ^32. 

Analogous to the triangular solver, we reduce a problem of size N x N recursively un- 
til the subproblems flt into an i? x i? array of compute processors. Figure 8 illustrates 
the data movement of the matrices when computing flve systolic subproblems according to 
Equations 15-19. We need R memory processors to buflFer the columns of A, another R for 
the rows of L, and an additional R for the columns of U. Thus, our decoupled systolic LU 
factorization requires P — R^ compute processors and M — 3R memory processors. We 
observe that M = o(P), and the structure of our LU factorization is decoupling eflBcient. 
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Figure 7: Systolic LU factorization for A^ = 3. 
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Similar to our lower-triangular solver, the partitioning due to Equations 15-19 produces 
a set of heterogeneous subproblems. To obtain the most efficient composition of the subprob- 
lems, we group, pipeline, and overlap independent subproblems. Unfolding the recursion all 
the way to subproblems of size R permits a block-iterative schedule with efficient pipelining. 
Analogous to the standard three-fold loop of the LU factorization, we can solve Equation 15 
for each pivot block, pipeline the solvers of Equations 16 and 17 for the pivot row and column, 
and update the lower right matrix by pipelining the matrix multiplications of Equation 18. 
The computation of Equation 19 corresponds to the factorization of the next pivot block. 

Efficiency Analysis 

We approximate the number of multiply-and-add operations of an A^ x A^ LU factorization 
by C{N) ^ A^^/3. This approximation neglects an asymptotically insignificant linear term, 
if we count divisions as multiply-and-add operations. 

To find an efficient schedule for the LU factorization, we apply the methodology that we 
introduced for the lower-triangular solver. We unfold the recursive partitioning to the leaves, 
where each subproblem has size Rx R^ and group the individual systolic computations so as 
to maximize their overlap. Our block-iterative schedule resembles the standard 3-fold loop 
of the LU factorization. For each pivot block A;, we factor A^^ into L^^ ctnd Ukk^ ripply lower 
and upper triangular solvers to compute the Li^ and Ukj^ and update the lower right matrix. 
The systolic LU factorization requires AR time steps. The systolic solvers for computing 
the (cr — k) matrices Ukj use {a — k)R time steps plus 2R time steps to start and drain 
the pipeline. The computation of the {a — k) matrices Li^ also requires [a — k)R time 
steps, but 3i? time steps to start and drain the pipeline, because matrices A and U enter 
the array from the top and the bottom, cf. Figure 8(3), rather than the top and the right, 
cf. Figure 8(2). Updating the {a — k)x[a — k) lower right matrix blocks uses (a — k^R time 
steps plus 3i? time steps for starting and draining the pipeline. We may save a few time 
steps by overlapping the first and last operations of these phases as follows. We can overlap 
the first lower-triangular solver with the preceding LU factorization by R time steps, the 
first upper-triangular with the last preceding lower-triangular solver by 2R time steps, the 
first update operation and the last preceding upper-triangular solver by R time steps, and 
the LU factorization succeeding the last update operation by 2R time steps. The number of 
time steps of an A^ x A^ LU factorization with a x a blocks of size R x R is then: 

k=i 

a—l a—1 a—1 

+ E ((^ - ^)^ + 2i?) + E ((^ - k)R + 3R) + Y^ ((a - kfR + 3i?) 

k=l k=l k=l 

a-1 

k=l 

We observe that for a fixed a, the total number of time steps is r(A^) = 6(A^) when using 
{N/aY compute processors. 
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Figure 8: Phases of stream-structured LU factorization on an i? x i? array of compute 
processors, with N = 2R. In phase 1 we factor An into Lu and Uu according to Equation 15. 
In phase 2 we solve Equation 16 for Uu- In phase 3 we solve Equation 17 for L21. In phase 4 
we compute A22 according to Equation 18, performing the matrix subtraction on the memory 
tiles. Finally, in phase 5 we factor A22 according to Equation 19. The shapes of the matrices 
indicate how the rows and columns enter the compute array in a staggered fashion. 
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Using a network of size R^ P — R^ compute processors, and M — 3R memory processors, 
the floating-point eflBciency of our LU factorization according to Equation 1 is 



Eiu{(T,R) 



a 



R 



a' 



3 + la 



2 + ^a 



6 R + 3 



Analogous to our treatment of Emm ctnd Eus in Equations 4 and 13, we split Ei^ into two 

terms, one of which depends on a = N/R^ and one depending on the network size R. For 

cr = 1, the problem reduces to a single systolic LU factorization, and we obtain a compute 

eflBciency of 

1 /? 
EiJ(J = l,i?) = . 

Asymptotically, that is for large R and a, the compute eflBciency of our LU factorization on 
our DSA approaches the optimal value of 100%. As for the triangular-solver, when a ^ 1 
we have Eiu{R) ~ i?/(i? + 3), and we achieve more than 90 % eflBciency for R > 27. 

7 QR Factorization 

We want to factor an M x A^ matrix A with M > N into two matrices Q and i?, such that Q 
is an orthogonal M x M matrix, R is an upper triangular M x N matrix, and A = QR. 
We assume that the factorization is based on fast Givens rotations [2]. The application of a 
fast Givens rotation, expressed as a matrix multiplication, annihilates an element of A. We 
call the M X M matrix G{i,j) a fast Givens transformation, if it places value zero in 
element (i, j) of the product G{i,j)^ • A, and restrict G{i,j) = (gij) to the form 
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All diagonal elements of G{i,j) have value 1 and all oflF-diagonal elements have value 0, with 
the exception of elements gji and gij, which have non-zero values gji = a and gij = /3. Given 
matrix A = (a^j) and an auxiliary diagonal M x M matrix D — [di), a and /3 are determined 
as a = —aij/ajj, /3 = —adj/di, 7 = —a/3, di — {1 -\- j)di, and dj = (1 + j)dj. Matrix D 
allows us to compute a and /3 without square root operations. We initialize D such that 
d^ = 1 for 1 < i < M. 

We apply a series of fast Givens transformations to A to create an upper triangular 
matrix. Since transformation G{i,j) annihilates element a^j, we apply MN — N^ /2 — N/2 
fast Givens transformations with 1 < j < N and j < i < M io triangularize matrix A. Once 
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a particular sequence of fast Givens transformations is chosen, the computation of D and Q 
must observe this order. We pick a sequence of fast Givens transformations that generates 
an upper triangular matrix U by annihilating the elements below the diagonal column- wise 
from left to right, and within each column from top to bottom. More succinctly, we apply 
the sequence of fast Givens transformations 6^(2, 1), 6^(3, 1), . . ., G{M^ 1) to annihilate all 
elements below the diagonal in the first column, proceed by transforming the second column 
with G(3, 2), G(4, 2), . . ., G(M, 2), and so on up to column TV: 

G(M,A^)^...G(A^ + 1,A^)^...G(M,2)^...G(3,2)^G(M,1)^...G(2,1)^A = U. 

Since {ABY — B^A^^ we can write this product in compact form as^ 

N M 

G^A = U, where ^=11 11 G{iJ). 

Finally, let us discuss the role of diagonal matrix D briefly. By construction of the fast 
Givens transformation, we have G^G = D. Since D is diagonal, we may split D such that 
D = D^'^D^'^. Then we obtain D'^'^G^GD-^'^ = {GD-^'YiGD'^'^) = /, and notice 
that GD-^I^ is orthogonal. Thus, we may rewrite G^ A = [/ as {D-^I^G^)A = D'^I^U with 
the consequence that Q — GD~^/'^ is orthogonal and R = D'^^'^U is upper triangular. 

Partitioning 

We partition the QR factorization according to Equation 21 for M > A^. Without loss of 
generality, we restrict our discussion to the case where M = 3/2N, such that matrices Aij 
and Rij in Equation 21 are N/2 x N/2 matrices. 



(21) 



This partitioning allows us to formulate the QR factorization as a series of three sub- 
problems, deflned by Equations 22-24. 
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^We introduce the convention that the product notation defines a sequence of multiplications that corre- 
sponds to the sequence of indices in the product such that multiplications with larger indices are applied from 
the right, that is we interpret products as right-associative and the matrix multiplication as left-associative 
to determine the sequence uniquely. For example, 

2 3 2/3 \ 

n n G'(i,i)=n n G(z,i) =(g(2,i)-g(3,i))-g(3,2). 
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(23) 



(24) 

Equations 22-24 enable us to compute the QR factorization by solving three smaller 
problems. First, we triangularize columns 1, . . . ^N/2 of A according to Equation 22. As 
a result, we obtain Rn and a sequence of fast Givens transformations associated with ma- 
trix Qi. Next, we update columns N/2 + 1, . . . , A^ of A according to Equation 23, which 
yields R12 and the intermediate matrices A'22 and A32. This computation uses the sequence 
of fast Givens transformations calculated in the previous step, without computing Qi ex- 
plicitly. Then, we triangularize columns A^/2 + 1, . . . , A^ according to Equation 24, resulting 
in R22 and the sequence of fast Givens transformations associated with Q2. Matrix Q2 is 
an A^ X A^ submatrix of Q2. Matrices Q, Qi, and Q2 are defined in terms of fast Givens 
transformations by Equations 25-27: 

(NI2 M \ 

Qi = n n G{i,mD-"' (25) 



7 



N M 



Q2=\ ^ n = n n G{i,j)\b-"' (26) 

V ^' / \j=NI2+U=j+l ) 

(N M \ 

n n G{i,3)]b-"'D-"\ (27) 

where D — (di) and D = (di) are diagonal M x M matrices. Furthermore, di = di for 
< i < N/2 and di — I otherwise, and di — di for A^/2 + 1 < i < M and di — 1 otherwise. 

According to common practice, we do not compute the intermediate forms Qi and Q2 
of matrix Q in order to triangularize matrix A. Instead, we utilize the sparse structure of 
the fast Givens transformation to operate with a highly eflBcient representation. Recall from 
the definition oiG{i^j) in Equation 20 that G{i^j) contains only two characteristic values a 
and /3. Thus, it suflBces to represent G{i^j) by means of the pair {aij^ /3ij). In addition to 
being a space eflBcient representation, it is also advantageous for implementing the only two 
operations associated with fast Givens rotations, premultiplication and postmultiplication. 
We say that we premultiply a matrix A with fast Givens transformation G{i^j) when 
forming the product G{i^j)^ • A, and we postmultiply matrix A when forming the product 
A • G{i,j). Due to the structure of G{i,j), premultiplication eflFects rows i and j of A only, 
and postmultiplication changes the values in columns i and j of A only, cf. Figure 9. 

Premultiplication of an M x M matrix A = (a^j) produces elements a'jj. = ajk + Aj * (^ik 
in row j and a[f. = aik + aij • ajk in row i for 1 < A; < M. All other elements of A 
remain unchanged by premultiplication. Analogously, postmultiplication results in elements 
a'j^j = CLkj+l^ij-aki in column j and elements a'j^^ = aki+aij-a^j in column i for 1 < A; < M. All 
other elements of A are retained by postmultiplication. We note that both premultiplications 

23 



] I 



G(i,jy-A: 



A-G(iJ): 



Figure 9: The product of premultiplication G{i^j)^ • A differs from A in rows i and j only, 
while the product of postmultiplication A • G{i^j) differs from A in columns i and j only. 

and postmultiplications offer various degrees of parallelism. Specifically, we will exploit the 
fact that the premultiplications G{i^j)^ -A and G{k^ l)^ -A and postmultiplications A'G{i^j) 
and A • G{k^ I) are independent for mutually distinct values of i, j, k and /. 

In summary, the partitioning of Equation 21 permits us to express the computation of 
matrix R of the QR factorization as two smaller QR factorizations of Equations 22 and 24 
and a sequence of premultiplications with the fast Givens transformations associated with Qj 
according to Equation 23. If matrix Q is not desired, as may be the case when using the 
factorization as part of a linear system solver, Qi and Q2 do not have to be computed 
explicitly. If the Q matrix is desired it can be computed by means of postmultiplications 
using a partitioning along rows instead of columns. 



Decoupling 

Our decoupled design for the QR factorization is based on three systolic algorithms, each 
using an i?xi? array of compute processors. The first algorithm performs a systolic Givens 
computation by triangularizing an M x i? matrix, where M > R] cf. Equation 22 and, 
analogously. Equation 24. The systolic Givens computation produces a sequence of fast 
Givens transformations G{i^j)^ each of which is represented by the pair [aij^^ij)^ and the 
corresponding values of diagonal matrix D. The second algorithm implements the update 
operation of Equation 23 as a systolic premultiplication of an M x i? matrix. The 
third algorithm computes matrix Q by means of systolic postmultiplications according 
to Equations 25-27 on i? x M matrices. 

Figures 10 and 11 illustrate the systolic Givens computation for triangularizing an M x i? 
matrix A, where M > R. Columns of matrix A enter the array at the top. In addition, 
the elements of diagonal matrix D enter the array by interleaving them with the leftmost 
column of matrix A. The resulting upper-triangular matrix R leaves the array at the bottom. 
Furthermore, the Givens computation yields a sequence of pairs {aij^/3ij) and a sequence of 

values 5i = d^ that leave the array on the right. In addition, intermediate values^ of 

(3) 
Si leave the array at the bottom of processor p^; for example d\ in time step 25. The 

diagonal processors pjj of the array compute the sequence of pairs {aij^/3ij) for the entire 

column j, that is for all rows i with j < i < M. In addition, these processors compute value 



^We denote the sequence of n updates of diagonal element di as: di, d\ \ d\ \ d\ , 



. ., d\^, and, finally, 
compute 5i = l/\/d\^\ Similarly, upper triangular element rij {i < j) evolves from aij via a sequence of n 



intermediate values: a^j, u\j\ u\j\ 
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Figure 10: Systolic Givens computation for an M x R matrix, where M > R. We show 
the triangularization of a 4 x 3 matrix A, such that R = D'^l'^ (^(4, 3)^ G{A, 2)'^ (^(3, 2)'^ 
G'(4, 1)"^ (^(S, 1)"^ G{2, lY • A. The fast Givens transformation G{i,j) is represented by the 
pair {aij,l3ij), and 5i = d~ ' . [continued in Figure 11] 
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Figure 11: Continuation of Figure 10. 
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7zj = —aij/3ij^ which is used to update elements df^ ^ = {l+jij)dl^^ and dj ^ = {l+jij)dj 



of diagonal matrix D. After all updates are applied, we compute 5i = l/yd^^K The updates 
of the diagonal elements are scheduled on either the diagonal or subdiagonal processors of 
the array. 

The systolic Givens computation involves a relatively large number of operations even 
for the small 4x3 example in Figures 10 and 11. Therefore, rather then discussing the 
computation of a particular element, we describe the data flow through the array at a higher 
level. As mentioned before, the processors on the diagonal of the array are responsible 
for computing the fast Givens transformations. In particular, processor pu computes the 
transformations 6^(2, 1) during time steps 4-8, 6^(3, 1) during time steps 9-13, and 6^(4, 1) 
during time steps 14-18. For example, transformation 6^(2, 1) involves computing a2i in time 
step 4, /321 in time step 5, 721 in time step 6, and updating the diagonal elements di and ^2 
during time step 8 on the diagonal and subdiagonal processors pu and P21. Analogously, the 
fast Givens transformations 6^(3, 2), and 6^(4, 2) are computed on processor ^22 with support 
of ps2 during time steps 12-21, and 6^(4,3) on processor ^33 during time steps 20-25. 

The systolic Givens computation in Figures 10 and 11 produces an upper triangular 
3x3 matrix by computing the premultiplications with the fast Givens transformations. All 
updates due to premultiplications occur on the upper triangular processors of the array. 
They generate the elements Tij {i < j) via a sequence of intermediate values: a^j, uij\ uij\ 

. . ., ul^\ Tij — Siulj . Recall that the updates of diagonal element 5i involve multiplications 

with (l+jij) for i > j and with (1+7^^) for i < j. The systolic Givens computation according 

to Equation 22 will therefore generate intermediate values of ^^, that require further updates 

(3) 
when computing Equation 24. In Figure 11, only one intermediate value, ^4 , is produced, 

which leaves the array at the bottom of ^31 at time step 25. It will enter the array again, as 

explained during the discussion of Figure 16 below. 

Figures 12 and 13 illustrate the systolic premultiplication, which updates matrix A ac- 
cording to Equation 23. This update uses the pair-representation of the Givens transfor- 
mations, so that matrix Qj does not have to be computed explicitly. The systolic array 
produces the R x R matrix R^ and the (M — R) x R matrix {A22 ^32)^ of Equation 23. 
Columns of matrix A enter the array at the top, and the pairs (a^j, /3ij) of fast Givens trans- 
formation G{i^j) as well as the diagonal elements of matrix Z)~^/^ enter the array from 
the right. Processor pij computes element Tij of matrix R12. The elements of matrices R12 
and {A22 ^32)^ leave the array at the bottom such that the values of {A22 ^32)^ precede 
those of R12. 

Compared to the systolic Givens computation, the data flow through the array in Fig- 
ures 12 and 13 is relatively straightforward. We show the systolic premultiplication of a 4 x 3 
matrix A. According to Equation 23, the premultiplication for this particular example is: 



= D-'/^G{4, 3)^G(4, 2)^G(3, 2)^G(4, 1)^G(3, 1)^G(2, 1) 



where the multiplications are executed from right to left. For example, consider the compu- 
tation of element rn. Its value is determined by the premultiplications with 6^(2, 1), 6^(3, 1), 
and 6^(4, 1), generating a sequence of intermediate values an, Uii\ Uii\ Uii\ and flnally rn: 
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Figure 12: Systolic premultiplication for computing R and update A according to Equa- 
tion 23. [continued in Figure 13] 
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Figure 13: Continuation of Figure 12. 
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nS = an + hi ■ ^21, «i? = u^i + /^si • asi, nj? = u^S + Ai • a,,, m = S, ■ ng\ In Figures 12 
and 13, each of these updates is computed by processor pu. The values of the first column 
of A enter pu from the top. The (a, /3)-pairs associated with the fast Givens transformations 
G{2, 1), ^(3, 1), and ^(4, 1) enter pu from the right. Values a2i and /32i are available for 
the first update at pu during time step 6, resulting in Uii . The second update occurs at 
time step 8, and the third at time step 10. At time step 11, diagonal element ^i arrives at 
processor pn, resulting in the computation of rn. Thereafter, rn travels downwards through 
the array, one processor per time step, until it leaves the array at the bottom of processor psi 
at time step 14. 

The computations of elements a[j ioi i > R are analogous to those of the rij. For example, 

the computation of a4^ produces the sequence of intermediate values: a4i, a\i — a4i + a4iaii, 
a\i — a\i +Q^42^2i\ ^41 — ^41 +<^43^3i • The corresponding updates occur at time step 9 on 
processor pn, time step 10 on processor P21, and time step 11 on processor ^31. Element a'^^^ 
leaves the array at the bottom of processor ^31 at time step 12. 

Figures 14 and 15 illustrate the systolic postmultiplication, which computes matrix Q 
of the QR factorization according to Equation 27 by postmultiplying the sequence of fast 
Givens transformations into the identity matrix. Here, we denote intermediate values of G 
as Q' according to 

(R M \ 

n n G{i,j)\^b-'i' 
j = l Z=J + 1 / 

where Qu is an i? x i? submatrix of Q, Q'12 ^^d Q[^ are RxR matrices of intermediate values, 
and / is an i? X i? identity matrix. The RxR array of compute processors in Figures 14 
and 15 generates an i? x M block of rows at a time, where M > R. Matrix G = (gij), 
which is initially the M x M identity matrix, enters the processor array at the top. The 
fast Givens transformations are represented by their (a, /3)-pairs, and enter the array from 
the right. The diagonal elements of matrix Z)~^/^ enter the array from the right following 
the fast Givens transformations. Matrices Qu, (5i2 and Q[^ leave the array at the bottom, 
such that processor pjn emits row i of matrix {Qu Q[2 Q'u)^ and the values of Q'^^ and Q[^ 
precede those of Qw 

Processor pij of the array computes element Qij of Qw For example, consider the com- 
putation of element ^n, which results from postmultiplications with 6^(2,1), 6^(3,1), and 
6^(4,1). Starting with 5^11, processor pu generates the sequence of intermediate values: 

fi'ii^ = 9ii + 1^21912^ Qu = fi'ii^ + /^sifi'is, 9ii = 9ii + I^ai9ia^ and finally qu = ^i9ii - The 
first update occurs on processor pu at time step 6, the second at time step 8, the third at 
time step 10, and the computation of qu at time step 11. The intermediate values g\j ^ for 

("3) ("3) 

j > R are computed before the elements of Qw In Figure 15, these are the values gi^\ 5^24 , 

(3) 
and 5^34 . We note that the number of updates involving zero-elements of the initial matrix 

G = / is asymptotically insignificant compared to the total number of updates. Hence, we 
leave the updates involving zero-elements in the computation to keep the systolic array as 
simple as possible. 

We are now ready to compose our three systolic algorithms for fast Givens computa- 
tion, premultiplication, and postmultiplication, and discuss the decoupled data flow of the 
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Figure 14: Systolic postmultiplication for computing an i? x M block of Q 
IG{2, 1)G{3, 1)G'(4, 1)^(3, 2)^(4, 2)^(4, 3)L»-V2. [continued in Figure 15] 
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Figure 15: Continuation of Figure 14. 
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complete QR factorization. Figure 16 shows the twelve systolic computations needed to 
compute both Q and i? of an M x A^ matrix Aioi M — 3R and N = 2R on an i? x i? array 
of compute processors. In addition, 3R memory processors are required on the periphery 
of the compute array. We use two fast Givens computations to compute i?ii, i?22, and the 
corresponding fast Givens transformations in phase 1 according to Equation 22 and phase 3 
according to Equation 24, respectively. Phase 2 implements the premultiplication according 
to Equation 23. The remaining phases use the postmultiplication to compute matrix Q. In 
particular, matrix Qi of Equation 25 is computed during phases 4-6, one R x M row block 
per phase, and matrix Q is computed during phases 7-12. 

The fast Givens computation of phase 3 is reflected about the horizontal axis, when 
compared to our presentation in Figures 10 and 11. This reflected version of the systolic 
algorithm allows us to stream the results of the premultiplication from phase 2 right back 
into the array. Similarly, we use reflected versions of the postmultiplication during phases 7- 
9. We note that the computation of R is flnished after phase 3. Thus, for linear system 
solvers that do not require knowledge of Q, we could stop the computation after these three 
phases. Phases 7-12 for the computation of Q deserve further discussion, because our imple- 
mentation deviates slightly from Equations 25-27. Rather than computing Q2 explicitly, we 
postmultiply Qi with the corresponding fast Givens transformations directly. However, these 
postmultiplications apply to the right-most M x (M — R) submatrix of Qi only, as shown in 
phases 7-9. Finally, the right-most column block (Q'l^ Q^'^ Qzz)^ of the intermediate matrix 
must be multiplied by D~^/^ according to Equation 26. This multiplication is implemented 
in phases 10-12. Matrix Q is now available in parts on the memory processors at the top 
and in parts at the bottom of the machine. 

Analogous to our other stream algorithms, the partitioning due to Equations 22-26 pro- 
duces a set of heterogeneous subproblems. We reduce the QR factorization of size M x N 
until the subproblems can be computed on an i? x i? array of compute processors. Figure 16 
represents the data movement when computing the systolic subproblems of Equations 22- 
27. We use R memory processors at the top of the array to buflFer the columns of A and 
intermediate results, another R at the bottom of the array for the rows of Q and columns 
of i?, and R more to the right of the array for storing Givens transformations. Thus, our de- 
coupled systolic QR factorization requires P — R^ compute processors and M — 3R memory 
processors. We observe that M = o{P) and thus our QR factorization is decoupling eflBcient. 

Efficiency Analysis 

In the following, we analyze the eflBciency of computing R and Q of an M x A^ matrix A 
separately, because the computation of Q is optional. To handle the general case where 
M > A^, we introduce two cr-values aM = M/R and a^ = N/R with network size R. We 
approximate the number of multiply-and-add operations for computing matrix R by means 
of fast Givens transformations as Cr{M, N) ^ N'^{M — N/3). This approximation neglects 
an asymptotically insigniflcant quadratic term that includes division operations and a linear 
term including square-root operations. 

We begin by analyzing the number of time steps for the computation of matrix R. We 
apply the methodology used for the lower-triangular solver and the LU factorization, by 
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Figure 16: Phases of a stream-structured QR factorization on an i? x i? array of compute 
processors, with M = 3i? and N = 2R. In phase 1, we triangularize the left-most N/2 
columns of A^ producing Ru and a set of fast Givens transformations as expressed in Equa- 
tion 22. In phase 2, we apply the Givens transformations to the right-most N/2 columns 
of A giving us i?i2, ^22^ ^^d A32 according to Equation 23. In phase 3, we triangularize 
the right-most columns of A according to Equation 24. Phases 4-6 compute Qi according 
to Equation 25, while phases 7-12 compute Q = Q1Q2 according to Equation 27. Note 
that phases 3 and 7-12 can be modified without loss of eflBciency, so that matrix Q is not 
distributed across the top and bottom rows of memory processors, but would be available 
on one side only. 
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partitioning the problem recursively until matrix A consists of aM x a^ blocks of size Rx R. 
Equation 28 illustrates the factorization for aM = 5 and a^ = 4. 



(28) 



To facilitate the efficiency analysis, we discuss our block-iterative schedule in detail. Our 
partitioning in Equations 22-24 extends to the 5x4 case as follows. The factorization sweeps 
across pivot blocks from top left to bottom right. For each pivot block, we triangularize and 
update. First, we annihilate all lower-triangular elements in the pivot column using a systolic 
fast Givens computation. Then, we update the column blocks to the right of the pivot 
column by means of premultiplications, cf. phases 1 and 2 in Figure 16. In our example of 
Equation 28, we first annihilate all lower-triangular elements in column block An. Second, we 
premultiply the 5x3 block matrix consisting of column blocks {Ai2 Ais ^^4) for 1 < i < 5. We 
apply our systolic premultiplication once to each of these column blocks separately. We then 
proceed with triangularizing column block Ai2. The subsequent premultiplication effects the 
matrix (^^3 Ai^) for 2 < i < 5, that is excluding the top row. Similarly, after triangularizing 
column block ^^3, the subsequent premultiplication effects matrix (A44 ^54)^ only. Figure 17 
shows the areas of matrix A that are effected by the fast Givens computation (a) and the 
premultiplication (b) when handling column block i. 
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Figure 17: (a) The fast Givens computation of Equation 22 (or Equation 24) for column 
block i effects the bottom (aM — i + l)R rows, (b) After triangularizing column block i, we 
update the {aM — i + l)R x (a^ — i)R submatrix of A according to Equation 23. 

We now consider the general case where A is a aM x cfn matrix. The behavior of the 
systolic fast Givens computation shown in Figures 10 and 11 is as follows. The critical path 
is determined by the computations of the processors on the diagonal of the array. Processor 
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Pkk requires 5 time steps to compute a, /3, 7, and intermediate values of d^ and u^k for 
each Givens transformation. Thus, the fast Givens transformation of column block i uses 
5(crM — i + i)R time steps. In addition, starting up the pipeline takes 5R time steps and 
draining lOR time steps, resulting in a total of 5{aM — i + l)R + 15i? time steps. 

The number of time steps for the premultiplications associated with column block i can 
be counted as follows. The systolic premultiplication requires 2 time steps per output value 
according to Figures 12 and 13. For {a^ — i) column blocks and with {aM — i + 1) row blocks 
each, the premultiplication takes {a^ — i) * 2(crM — i + i)R time steps. In addition, starting 
the pipeline requires 2R time steps and draining AR time steps for a total of 2{aN — i){crM — 
i + 1)R + 6R time steps. 

The number of time steps for computing upper-triangular matrix R of the stream- 
structured QR factorization is summed up as follows. For a aM x a^ block matrix A 
with block size i? x i? on a network of size i?, where the postmultiplication can overlap the 
preceding fast Givens computation by R time steps, we have 



aN 



Tr{cJM,(JN^R) ^ Y.5{aM-i + l)R+l5R 



i=l 

+ ^ 2{aN-i){(JM-i + l)R + QR 

i=l 

(7N-1 
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0/2 1 3 . 3 2 101 ,, 

o 2 o 

Using a network of size R^ P — R^ compute processors, and M — 3R memory proces- 
sors, the floating-point eflBciency of the computation of upper triangular matrix R of the 
QR factorization is 



Er{cfM,(^N, R) 



C^N^M — ^CTn R 



c^N^M - |cr^ + 4:aMcrN - fc^AT + ^^N "5 R + 3' 



Asymptotically, that is for large network size i?, aM^ and cr^v, the compute eflBciency ap- 
proaches the optimal value of 100%. As for the triangular-solver and LU factorization, 
when (Jm^cfn ^ 1 we have Er{R) ~ i?/(i? + 3), and we achieve more than 90% eflBciency 
for R > 27. 

We now turn to the eflBciency analysis of the computation of matrix Q of the QR factor- 
ization. The number of multiply-and-add operations is Cq{M, N) = 2M'^N - MN - MN^. 
We use a block-iterative schedule to overlap and pipeline the systolic postmultiplications. 
We partition the problem recursively until Q is a gm x cfm block matrix, and each block is 
di Rx R matrix. The postmultiplications associated with column block i of Q also eflFect the 
all column blocks i + 1, ... , gm to the right of i, as shown in Figure 18. We apply our sys- 
tolic postmultiplication to individual row blocks. Thus, the computation of column block i 
applies aM systolic postmultiplications to a row block of size R x (aM — i + l)R. According 
to Figures 14 and 15, the computation of each output value requires 2 time steps. Therefore, 
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the number of time steps for postmultiplying gm row blocks associated with column block i 
is 2RaM{cfM — i + 1). With a startup time of the systolic postmultiplication of 2R time 
steps, and a drainage time of AR time steps, the number of time steps for column block i is 
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Figure 18: We compute column block i oi M x M matrix Q by partitioning the computation 
into Gm row blocks, each with R rows and [gm — i + 1)R columns. Each row block is 
computed using a systolic postmultiplication. After the postmultiplication, the hatched area 
of the matrix holds the final values of Q, while the shaded area comprises intermediate 
values. 

We obtain the number of time steps for computing matrix Q of the QR factorization of 
an M X A^ matrix A by summing up the number of time steps for each update of matrix Q. 
Matrix Q is updated once for each set of Givens transformations produced by the systolic 
Givens computation array, and there are a^ sets of Givens transformations. We can save 
2R time steps per column block by overlapping consecutive computations. 
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Using a network of size R^ P — R^ compute processors, and M — 3R memory processors, 
the fioating-point eflBciency of computing matrix Q of the QR-factorization is 
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Asymptotically, that is for large network sizes i?, gm-^ and cr^v, the compute eflBciency of 
computing matrix Q matrix approaches the optimal value of 100 %. When gm-^ ct^v ^ 1, we 
have Eq{R) ^ i?/(i? + 3), and we achieve more than 90 % eflBciency for R > 27. 

The number of time steps and eflBciency of the entire QR factorization involves the 
computation of both R and subsequently Q. The number of multiply-and-add operations 
is C{M,N) = Cr{M,N) + Cq{M,N) ^ 2M^N - N^S, The number of time steps for 
computing R and Q is the sum of the time steps for the individual computations: 

/I 3 125 \ 

Tqr{(JM, CFN, R) ^ R f 2cr^Cr7V - -Cr% + ^GmC^N " -(^N + ~^^^ ~ ^J ' 
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Using a network of size R, P = R^ compute processors, and M = 3R memory processors, 
the floating-point efficiency of our stream-structured QR factorization is 

P (^ ^ o^ ^ '^(^M^N - |t7^ R 

l^,r[<JM,<J^,K) ~ 2al,ar,-la% + 5aM<7j,-la%+'-faj,-3' R + 3- 

For aM = cr^v = 1, the problem reduces to a single systolic QR factorization, and we obtain 
a compute efficiency of 

5 R 

Eqr{cfM = 1, CTtv = 1, i?) ^ — - 



69 i? + 3 

The expressions for execution time and compute efficiency are easier to comprehend when 
considering a square matrix A of dimension N x N. Then, aM = cfn^ and we can express 
the execution time and efficiency as a function of a = aM = ctn and R: 
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We note that for a fixed a, the QR factorization of an A^ x A^ matrix requires T{N) 



f ^^ + 2^ + ^ ~ 3/cr)A^ = 6(A^) time steps with {N/aY compute processors. 



8 Convolution 

The convolution of vector a of length M with vector w of length N produces an output 
vector h of length M -\-N — 1. Without loss of generality, we assume that M > N. Element k 
of b is given by 

h = Zl ^i • ^j (29) 



where 



1 < A; < M + N - 1 
1< i < M 
l< j <N. 



Partitioning 



We partition the convolution into N/R subproblems by partitioning the sum in Equation 29 
as follows: 

N/R 

h = Yl Y ^i' ^j (30) 
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where 

1 < A; < M + R-1 
1< i < M 
{l-l)R+l< j <IR+1. 

This partitioning expresses the convolution of a and w as the sum of convolutions of a with 
N/ R weight vectors Wj. Intuitively, we partition weight vector w into chunks of length i?, 
compute the partial convolutions, and exploit the associativity of the addition to form the 
sum of the partial convolutions when convenient. 

Decoupling 

We use the systolic design of Figure 19 to implement a convolution with N — R. This design 
is independent of the length M of vector a. For the example in Figure 19 we have chosen 
N — R — A and M = 5. Both vector a and weight vector w enter the array from the left, 
and output vector h leaves the array on the right. Compute processor pi is responsible for 
storing element Wi of the weight vector. Thus, the stream of elements Wi folds over on the 
way from left to right through the array. In contrast, vector a streams from left to right 
without folding over. During each time step, the compute processors multiply their local 
value Wi with the element of aj arriving from the left, add the product to an intermediate 
value of hk that is also received from the left, and send the new intermediate value to the 
right. The elements of h leave the array on the right. 



W4 W3 W2 Wi 
^5 ^4 ^3 ^2 ^1 



(1) 



3.A 3-2 



W2 W; 

a? a. 



(5) 



(9) 



b, 







b^ 


bs 






K 


b^ 


Wi 


W2 


W3 


W4 




as 


8483 


82 ai 



W4 W3 W2 Wi 
^5 ^4 ^3 ^2 ^1 



(2) 



(6) 



(10) 



bj 


b^ 


b3 




Wi 


bi 
W2W4 


b^ 
W3 


b2 


8584 


8382 


ai 









b) 


be 


Wi 


W2 


W3 
8584 


b^ 
W4 

8382 



W, W2 

82 a, 


b, 







(3) 



(7) 



85 84 





b^ 


bj 




Wi 


bj 
W2 


b? 
W3W4 


bs 


as 


8483 


8281 





by 









b) 


Wi 


W2 


W3 


W4 






as 


8483 



(11) 



Wi W3 
83 82 



bj 



a, 



b, 



(4) 



(8) 



(12) 





K 


b^ 


b4 






b^ 


b5 


Wi 


W2 


W3 


W4 




8584 


8382 


ai 



Wi 


W2 


W3 


W4 
8584 



^b. 



Figure 19: Systolic convolution of input sequence a^ of length M = 5 with A^ = 4 weights Wj. 
Both the weights and input sequence are fed into the linear array of i? = 4 compute proces- 
sors. Intermediate results are shown above the corresponding processors. Value h\ represents 
an intermediate value of h^ after the first i products have been computed according to Equa- 
tion 29. 

We illustrate the data movement in Figure 19 by discussing the computation of 64 = 
a/^wi + a'^W2 + a2W'^ + aiW4^. We begin with time step 5 in Figure 19 when element a/^ 



39 



enters processor pi on the left. Element Wi is already resident. Processor pi computes the 
intermediate value bl = a^ - Wi^ and sends it to processor p2. At time step 6, p2 receives as 
and 64 from processor pi on the left. With weight W2 already resident, processor p2 computes 
intermediate value bl = bl + as - W2. In time step 7, values 6|, a2, and Ws are available for 
use by processor ps. It computes and sends intermediate value b\ — b\-\- a2 - Ws towards 
processor p^. At time step 8, p^ receives 6|, ai, and w^ from ^3, and computes 64 = bl + ai-w^. 
At time step 9, 64 exits the compute array. 

We use the partitioning of Equation 30 to reduce a convolution with a weight vector 
of length N into N/R systolic convolutions that match network size i? of a linear array of 
compute processors. In addition, we employ one memory processor on the left of the array 
to buffer vectors a and w^ and another memory processor on the right of the array to store 
intermediate values of the computation as well as to compute the sum of the subproblems. 
Figure 20 illustrates the computation of a convolution on a linear processor array. Our 
decoupled systolic convolution requires P — R compute processors and M — 2 memory 
processors. We observe that M — o{P) and, therefore, our convolution is decoupling efficient. 
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Figure 20: Stream convolution of an input sequence of length M = 5 with A^ = 4 weights 
on a linear array of i? = A^/2 = 2 compute processors and M = 2 memory processors. Value 
6/ represents the computation of b^ when the outer summation of Equation 30 has been 
executed / times and the inner summation has been executed i times. Note that the memory 
tile on the right performs an addition to accumulate the results of the partial convolutions. 
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Efficiency Analysis 

The number of multiply-and-add operations in the convolution of a sequence a of length M 
with a weight sequence w of length N is C(M, N) = MN. On a linear network of size R with 
P{R) = R compute processors and M = 2 memory processors, we partition the computation 
into a = N/R subproblems, each of which is a convolution of sequence a of length M with 
weight sequence w of length R. These subproblems overlap perfectly as is obvious from 
Figure 20. 

We account for the time steps of the convolution as follows. There are a = N/R systolic 
convolutions on a linear array of R compute processors that pipeline perfectly. Each of the 
systolic convolutions requires M + R time steps to stream a sequence of length M through 
an array of size i?, and because each processor executes one multiply-and-add operation per 
time step. Due to the perfect overlap of subsequent systolic convolutions, the R time steps 
needed to drain the pipeline are incurred only once. Thus, the number of time steps is: 

Using a linear network of size R consisting oi P — R compute processors and M — 2 memory 
processors, the floating-point eflBciency of the convolution is: 

Given our assumption that M > A^, we have N/M < 1, and the eflBciency of our stream- 
structured convolution approaches the optimal value of 100 % for large values of a and R. 
Thus, for cr ^ 1, we have Econv ~ R/{R + 2) and obtain more than 90% eflBciency for 
R > 18. For A^ = i? or, equivalently cr = 1, the stream convolution reduces to a systolic 
convolution with a compute eflBciency of 

1 /? 



1 + N/M R + 2 

We note that for A^ = M the eflBciency of the systolic convolution has an upper bound of 
50%. Our stream convolution defles this upper bound provided that a is suflBciently large. 

9 Conclusion 

We have studied the feasibility of highly eflBcient computation on tiled architectures. Because 
these architectures are constrained by short wires, we found that systolic computation, which 
considers both architecture and algorithms simultaneously, proves invaluable even with to- 
day's microtechnology We developed stream algorithms together with a decoupled systolic 
architecture as a means to exploit the similarity between systolic arrays and tiled architec- 
tures. In addition, we discovered that for many regular applications decoupling memory 
accesses from computation allows us to move load and store operations oflF the critical path. 
The resulting reduction of the critical-path length enables us to increase eflBciency by in- 
creasing the number of processors, and even approach 100 % compute eflBciency in the limit. 
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Unlike systolic arrays, our decoupled systolic architecture allows us to execute programmed 
stream algorithms of arbitrary problem size on a constant-sized machine. In contrast to 
contemporary parallel RISC architectures, our decoupled systolic architecture enables us to 
increase efficiency by increasing the number of processors. 
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Table 1: Summary of stream algorithms. We show the number of compute processors P, 
number of memory processors M, and the bounded amount of storage for local variables 
per compute processor S. In addition, we compare the execution time T and compute 
efficiency E. 

We presented five concrete examples of stream algorithms for a matrix multiplication, 
a triangular solver, an LU factorization, a QR factorization, and a convolution. Table 1 
summarizes our results for these applications. For each of our stream algorithms, the table 
lists the number of compute processors P and memory processors M as a function of network 
size R. We also show the amount of local state S as the number of scalar variables that each 
compute processor must maintain. Consider the matrix multiplication, for example. Each 
compute processor must store locally the values of problem size N ^ network size i?, its row 
and column indices within the compute array, the intermediate value of the matrix element 
being computed, a computed product element that is streamed towards the output, and 
three loop variables. Note that the space requirements for the local state on the compute 
processors is bounded for each of our stream algorithms by a reasonably small number. 
Table 1 also compares the execution times T and efficiencies E of our stream algorithms. 

Our experience with the design of stream algorithms has revealed three noteworthy in- 
sights. First of all, our design philosophy for stream algorithms appears to be quite versatile. 
We were able to formulate stream algorithms even for relatively complex algorithms like the 
QR factorization. Secondly, our stream algorithms achieve 100 % compute efficiency where 
conventional systolic designs cannot. The astute reader may have noticed that none of 
our systolic algorithms achieves 100 % efficiency. For an infinitely large network size i?, 
the efficiency of our systolic matrix multiplication is E'^m = 1/3? for the triangular solver 
Eits — 1/6, for the LU factorization Eiu — 1/12, for the QR factorization Eq^ — 5/69, and for 
the convolution Econv = V(l + N/M). However, all of our stream algorithms are compute 
efficient. Thirdly, our design philosophy for stream algorithms emphasizes the amortization 
of inefficient systolic computations by means of efficient ones. For example, in the QR fac- 
torization, the systolic algorithm for the Givens computation is quite inefficient, because 
only the diagonal processors of the compute array are fully utilized. However, the num- 
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ber of multiply-and-add operations required by the QR factorization is not dominated by 
the systolic Givens computation, but rather by the systolic premultiplication and postmul- 
tiplication, both of which are highly efficient, because we can pipeline a large number of 
problems. When partitioning a large problem into systolic subproblems, we focus our efforts 
on identifying and optimizing those critical subproblems which dominate the computational 
effort. 

In summary, we believe that stream algorithms provide an excellent match for the physical 
and technological constraints of future single-chip tiled microarchitectures. 
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